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Abstract 

Convolutional Neural Network (CNN) has been success¬ 
ful in image recognition tasks, and recent works shed lights 
on how CNN separates different classes with the learned 
inter-class knowledge through visualization [8, 10, 13]. In 
this work, we instead visualize the intra-class knowledge 
inside CNN to better understand how an object class is rep¬ 
resented in the fully-connected layers. To invert the intra¬ 
class knowledge into more interpretable images, we pro¬ 
pose a non-parametric patch prior upon previous CNN vi¬ 
sualization models [8, 10]. With it, we show how different 
‘'styles ” of templates for an object class are organized by 
CNN in terms of location and content, and represented in a 
hierarchical and ensemble way. Moreover, such intra-class 
knowledge can be used in many interesting applications, 
e.g. style-based image retrieval and style-based object com¬ 
pletion. 

1. Introduction 

Deep Convolutional neural networks (CNN) [6] achieve 
the state-of-the-art performance at recognition tasks. Re¬ 
cent works [12, 1, 11] have focused on understanding the 
inter-class discriminative power of CNN. In particular, [13] 
shows that individual neurons in different convolutional lay¬ 
ers correspond to texture patterns with various level of ab¬ 
straction and even object detectors can be found in the last 
feature extraction layer. 

However, little is known about how CNN represent an 
object class or how it captures the intra-class variation. For 
example, in the object class of “orange” and “pool table”, 
there are drastically different “styles” of the object instances 
which CNN recognizes correctly (Fig. 1). There are two 
main challenges of this problem. One is to visualize the 
knowledge numerically instead of directly retrieving natural 
images, which can be biased towards the image database 
that is in use. The other challenge is that such intra-class 
knowledge is captured collectively by a group of neurons, 
namely “neural pathway”, instead of a single neuron studied 
in previous works. 

In this work, we make progress on both challenges by (1) 
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Figure 1: Examples of intra-class variation. We show two 
different styles of the object class “orange” and “pool table” 
with the retrieved images and our new visualization method. 

introducing a patch prior to improve parametric CNN visu¬ 
alization models, (2) analyzing how the spatial and style 
intra-class knowledge are encoded inside CNN in a hierar¬ 
chical and ensemble way. With this learned knowledge, we 
can retrieve images or complete images in a novel way. Our 
techniques apply to a range of feedfoward architectures and 
we here focus on the CNN [5] trained on the large-scale 
ImageNet challenge dataset [4], with 5 convolutional layers 
followed by 3 fully-connected layers. 

2. Related Work 

Below, we survey works on understanding fully- 
connected layers and intra-class knowledge discovery. 

2.1. Fully-Connected Layers in CNN 

Understanding Below are some recent understandings of 
fully-connected layers. (1) Dropout Techinques. [5] con¬ 
sider the dropout technique as an approximation of learning 
ensemble models and [2] proves its equivalence to a regu¬ 
larization; (2) Binary Code. [ ] discovers that the biniary 
mask of the features from fce-? layers are good enough for 
classification. (3) Pool5. ps features contain object parts in¬ 
formation with spatial and semantic, we can combine them 
by selecting sub-matrices in Wq (4) Image Retrival from 
fcy! fc 7 is used as semantic space 
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Visualization Unlike features in convolutional layers where 
we can recover most of the original images with paramet¬ 
ric [12, 8] or non-parametric methods, features from fully- 
connected are hard to invert. As shown in [8], the loca¬ 
tion and style information of the object parts are lost. An¬ 
other work [10] inverts the class-specific feature from fcg 
layer which is 0 except the target class. The output image 
from numerical optimization is a composite of various ob¬ 
ject templates. Both these works follow the same model 
framework (compared in Sec. 3.1) which can be solved ef¬ 
ficiently with gradient descend method. 
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(a) Feature Inversion 


(b) Class visualization results 


(c) Pop-art-style images with similar pool5 feature (<1% difference) 




2.2. Intra-class Knowledge Discovery 

Understanding image collections is a relatively unex¬ 
plored task, although there is growing interest in this area. 
Several methods attempt to represent the continuous variaa- 
tion in an image class using sub-spaces or manifolds. Un¬ 
like this work, we investigate discrete, name- able trans¬ 
formations, like crinkling, rather than working in a hard- 
to-interpret parameter space. Photo collections have also 
been mined for storylines as well as spatial and temporal 
trends, and systems have been proposed for more general 
knowledge discovery from big visual data. [9] focuses on 
physical state transformations, and in addition to discover¬ 
ing states it also studies state pairs that define a transforma¬ 
tion. 

In Sec. 3, we analyze the problem of current parametric 
CNN visualization models and propose a data-driven patch 
prior to generate images with natural color distribution. In 
Sec. 4, we decompose the fully-connected layers into four 
different components, which are shown to capture the the 
location-specific and content-specific intra-class variation, 
or represent such knowledge in a hierarchical and ensemble 
way. In Sec. 5, we first provide both quantitative and qual¬ 
itative results for our new visualization methods. We apply 
the learned intra-class knowledge inside CNN to organize 
an unlabelled image collection and to fill in image masks 
with objects of various styles. 

3. CNN Visualization with Patch Prior 

Below, we propose a data-driven patch prior to improve 
parametric CNN visualization models [8, 10] and we show 
improvement for both cases (Fig. 3b). 

3.1. Parametric Visualization Model 

We first consider the task of feature inversion [8]. Given 
the CNN feature (e.g. pools) of a natural image, the goal is 
to invert it back to an image close to the original. [8] aims 
to find an optimal image that minimizes the sum of the data 
energy from feature reconstruction error and a regulariza¬ 


Figure 2: Illustration of the color problem in [10]. The 
results for (a) pools feature inversion and (b) class visual¬ 
ization have unnatural global color distribution. In (c), we 
show six output images with different global color distribu¬ 
tion, but similar pool5 features differed less than 1 % from 
each other. 


tion energy R{I) for the estimation. 


eL{i) = 




R{I), 


( 1 ) 


where is the CNN feature from layer k, 0o is the target 
feature for inversion. Similarly, another CNN visualization 
task, class visualization [10], follows a similar formulation, 
where the goal is to generate an image given the class label 

t. 


Eclass{l) = ^l-^S{I) + R{l), (2) 

where is the binary vector with only the t-th element 
one. 

For the regularization term R{I), the a-norm of the 
image ||/||2 [10] and the pairwise gradient||V/|||^ [8] are 
used. Unlike low-level vision reconstruction (e.g. denois- 
ing), the data energy from CNN is less sensitive to low- 
frequency image content, which leads to multiple global 
optima with unnatural color distribution. Given the input 
image (Fig. 2a), we show a collection of pop-art style im¬ 
ages whose pools features are less than 1 % from the input 
(Fig. 2c). These images are generated from [8], initialized 
from the input image with shuffled RGB channels. In prac¬ 
tice, [10, 8] initialize the optimization from the mean image 
with or without white noise, and the gradient descend algo¬ 
rithm converges to one of the global optima whose color 
distribution can be far from being natural (Fig. 2a-b). 

3.2. Data-driven Patch Prior 

To regularize the color distribution for CNN visualiza¬ 
tion, we build an external database of natural patches and 
minimize the distance of patches from the output to those 
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(a) Data-driven patch prior (b) Our CNN visualization results 

Figure 3: Illustration of the local optima of pool5 feature 
inversion. We show six output images with different global 
color distribution, but similar pool5 features differed less 
than 1 % from each other. 
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Figure 4: Illustration of the organization of Sec. 4. We de¬ 
compose the fully-connected layers into four components 
and each subsection explains how the intra-class knowledge 
is captured and represented. 


in the database. As the patches from the CNN visualization 
models above are lack of low-frequency components, we 
calculate the distance between patches after global normal¬ 
ization w.r.t the mean and std of the whole image respec¬ 
tively. Combined with previous regularization models, our 
final image regularization model is 

R{I) = RMl + RplNIWl + R'fT.WIp - DpWl (3) 

p 

where Ra-j are weight parameters for each term, p is the 
patch index. Ip are the densely sampled normalized patches 
and Dp are the nearest normalized patches from a natural 
patch database. In practice, we iteratively solve the contin¬ 
uous optimizaiton for I given the matched patches Dp and 
the discrete optimization for Dp with patch match given the 
previous estimate of I. 

To illustrate the effectiveness of the patch prior, we 
compute the dense patch correspondence from the patch 
database to a pools feature inversion result (Fig. 3a) [ 8 ], and 
visualize the warped image which regularizes the output im¬ 
age in Eqn. 3. We compare the patch matching quality with 
and without normalization. As expected, the normalized 
patches have better chance to retrieve natural patches, and 
the warped result is reasonable despite the unnatural color 
distribution of the initial estimation. 

Below, we describe how to build an effective patch 
database. The object class visualization task has no ground 
truth color distribution and we can directly sample patches 
from validation images from the same class. For feature 
inversion, however, such approach can be costly due to the 
intra-class variation of each object class, where images from 
the same class may not match well. As discovered in [7], 
conv-layer features can be used to retrieve image patches 
with similar appearance, though their semantic can be to¬ 
tally different. Thus, we build a database of IM pairs of 
1x1x256 pools features and the center 67x67x3 patch of 
the original patch support (195x195x3). Given the pools 
feature to invert, we build our patch database with such re¬ 
trieval methods and we show the averaged patches (10-NN 
at each pools location) recovers well the color distribution 


of the input image (Fig. 3a). 


4. Discover CNN Intra-class Knowledge 

For the class visualization task [10], notice that the 
back-propagated gradient from the CNN (data energy) in 
Eqn. 2 is a series of matrix multiplication. 


t dfcg _ t ^fcs 
dl ~ dl 


WiirmWTmeW^m^)^, (4) 


where is the c-th row of Wg, , me , mj are the relu 
mask computed on I for fce-? during feedfoward stage. 

Given the learned weights (Wes), we can turn on/off 
units from the mask ms -7 to sample different structure of 
the fc layers (“neural pathways”) by multiplying different 
sub-matrices from the learned weights. Another view is that 
the class-specific information is stored in Wg, and it can be 
decoded by different Wq-j structures through the relu mask 
ms- 7 . [ 10 ] uses all the weights in Wq-j, which leads to a 
composite template of object parts of different styles in all 
places (Eig. 2b). 

Below, by controlling the mask ms- 7 , we show that 
CNN captures two kinds of intra-class knowledge (location 
and style), which is encoded with an ensemble and hierar¬ 
chical representation (Eig. 4). 

4.1. Location-variation (ms) 

The output from the last convolutional layer is pools, 
which is semantically shown to be effective as object de¬ 
tectors [13]. Pools features (and its relu mask ms) have 
the 6 x 6 spatial dimension and we can visualize an object 
class within a certain receptive field (RE) by only opening 
a subset of spatial dimensions (e.g. k x k patches) during 
optimizing Eqn. 2. 

In Eig. 5, we show that the “terrier” class doesn’t have 
much variation at each RE, as it learns the dog head uni¬ 
formly. On the other hand, the “monastery” class displays 
heterogeneity, as it learns domes at the top of the image, 
windows in the middle and doors at the bottom. 





















Figure 5: Illustration of location-based variation, learn dif¬ 
ferent spatial prior. 

4.2. Content-variation (my) 

fc 7 has been used as the image semantic space and it has 
been reported indicative for image retrieval. 

Semantic Space as Convex Cone in fcy Notice that fcg is a 
linear combination of fcy. Thus, in the fcy feature space, if 
two feature vectors fi and /2 have the same predicted top -1 
class, then any feature vector / G O (linear cone) will have 
the same top -1 prediction 

O = {A(l — ^^ 2 ) • A > 0, G [0, !]]• 

Thus, given the training examples Namely if two linear 
poly tope (NMF) In Fig. 6 , we show the clusters result of 
the training examples, which capture different pose or con¬ 
tent of the object, which we calls the “style” of the object, 
fcr Topic Visualization Given the learned fcj topic above, 
we can apply its relu mask to my during optimizing Eqn. 2. 

4.3. Ensemble Encoding (me-?) 

During training, the dropout trick makes CNN an ensem¬ 
ble model by randomly setting 50% of the fce-r features to 
be 0, which is equivalent to turn off half of me- 7 . Below, 
we try to understand what each single me -7 model learns 
by reconstructing images according to Eqn. 2. We randomly 
sample 2 pairs of me -7 correspond to different styles of the 
objects and reconstruct the image with 2 different random 
initialization (Fig. 7). Interestingly, different models cap¬ 
tures different style of the object, where the variation across 
random initialization has smaller effect on the style. 

4.4. Hierarchical Encoding (ms- 7 ) 

Given an image, we can define its binary code by its relu 
masks ms-y. [ ] discovers that these binary code achieves 
similar classification result as their corresponding features. 
Similar to dropout model visualization, we invert the hash 
code by masking weight matrices Wq-j with these binary 
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Figure 7: Visualization of ensemble encoding. We show 
that different dropout model captures different aspects of a 
class in terms of (a) pose, (b) species, (c) spatial layout, and 
(d) scale 


hash code, namely constraining CNN to generate images 
only from these binary masks. We define three different 
binary hash code representation for an image with increas¬ 
ing amount of constraints: {my}, {me-yljms-y}. Dur¬ 
ing optimization, we replace (IT 7 , Wq) with (mylFy, Wq), 
{mfjWjrriQ, Wq) and {mfjWjrriQ, m^Wem^) in Eqn. 2 re¬ 
spectively. 
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Figure 8 : Illustration of hierarchical encoding. 








Figure 6: Visualization of various topics learned by CNN with retrieved images and our visualization. These templates 
capture intra-class variation of an object class: (a) scale, (b) angle, (c) color, (d) status and (e) content. 


5. Experiments 

5.1. CNN Visualization Comparison 

For CNN feature inversion, we provide qualitative com¬ 
parison with the previous state-of-the-art [8], and our results 
look more natural with the help of the patch-prior regular¬ 
ization (Fig. 9a). For quantitative results, we collect 100 
training images from different classes in the validation set 
of ImageNet. We use the same parameters for both [8] and 
ours, where the only difference is our patch-prior regular¬ 
ization. In addition, we empirically found that whiten the 
image as a pre-procession helps to improve the image regu¬ 
larization without much trade-off for the feature reconstruc¬ 
tion error. For error metric, we use the relative L 2 distance 
between the input image Jq and the reconstructed image I 
as 11 Jo — /|| 2 /II Foil 2 - We compare our algorithm with two 
version of [8]: initialized from white noise ( [8]-Frand) or 
the same patch database for ours ( [8]-f [7]). Shown in Table 
1 , ours achieves significant improvement. Notice that, with 
the whitening pre-procession and the recommended param¬ 
eters [8], most runs have feature reconstruction error <1%, 
and we here focus on one whose estimation is closer to the 
ground truth. 

For class visualization task, as there is no ground truth 
for sampling images from a given class, we provide more 
qualitative results for different kinds of objects (animal, 
plant and man-made) in Fig. 9b. Compared to [10], our vi¬ 
sualization results are closer to natural images and are easier 
to interpret. 

5.2. Image Completion with Learned Styles 

Given the mask of an image, we here show the qualita¬ 
tive results on object insertion and modification to explore 
the potential usage of such object-level knowledge for low- 
level vision with its top-down semantic understanding of the 
image. 

Object insertion from context Given a scene image 
(Fig. 10a), [3] can only fill in grass texture due to the 

lack of top-down image understanding. CNN, on the other 
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Figure 9: Qualitative comparison for CNN visualization. 
We compare (a) on CNN feature (pools) inversion with [8] 
and (b) on CNN class visualization with [10]. 


hand, can predict relevant object class labels due to their 
co-ocurrence in training images. For this example, the top- 
1 prediction for the grassland image is “Hay”. Our goal here 
is to inpaint the hay objects with different styles. 

We first follow Sec. 4.2 to learn the styles of hay objects 
from the Imagenet validation data. We visualize each topic 
with a natural image retrieved by it in the top row (Fig. 10a), 









Method 

[8]+rand 

[8]+ [ ] 

Ours 

Error 

0.51 

0.45 

0.32 


Table 1: Quantitative comparison of pools feature inversion 
methods. Conditioned on the feature reconstruction error 
less than a threshold, we compare the distance of the es¬ 
timated image from the original input. Our method out¬ 
perform the previous state-of-the-art [8] with two different 
initializations. 

which correspond to different scales of the hay. Given a 
fc 7 style topic, we can insert objects in the image by the 
procedure similar to our fcj topic visualization, where only 
pixels inside the mask are updated with the gradient from 
Eq. 2. In the second row, we see different styles of hays are 
blended with the grassland (Fig. 10a). 

Object Modification Besides predicting object class based 
on context information, CNN can locate the key parts of an 
object by finding the regions of pixels with high magnitude 
gradient dfcg/dl [10]. Given an input image of a persian 
cat (Fig. 10b), we use a simple thresholding and hole filling 
to find the support of its key part, the head. Instead of filling 
the mask with furs as PatchMatch does, CNN predicts the 
masked image as “Angora” based on the fur information 
from the body. Following the similar procedure as above, 
we first find three styles of angoras, which correspond to 
different sub-species with different physical features (e.g. 
head color), visualized with retrieved images in the third 
row. Our object modification result is shown on the bottom 
row, which change the original persian cat in an interest¬ 
ing way. Notice that the whole object modification pipeline 
here is automatic and we only need to specify the style of 
the angora, as the mask is generated from key object part 
located by CNN. 

6. Conclusion 

In this work, we analyze how CNN model the intra-class 
variation for each object class in fully-connected layers 
through an improved visualization technique. We find CNN 
not only captures the location-variation and style-variation, 
but also encodes them in a hierarchical and ensemble way. 
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